Skip to content

During first ssh connection, wait for routable ip#2070

Closed
maximenoel8 wants to merge 28 commits intouyuni-project:masterfrom
maximenoel8:force_ipv4
Closed

During first ssh connection, wait for routable ip#2070
maximenoel8 wants to merge 28 commits intouyuni-project:masterfrom
maximenoel8:force_ipv4

Conversation

@maximenoel8
Copy link
Copy Markdown
Contributor

@maximenoel8 maximenoel8 commented Mar 12, 2026

Problem

During the provisioning phase of test deployments, OpenTofu/Terraform was intermittently failing with the error:
dial tcp [fe80::...]:22: connect: invalid argument

This occurred because the libvirt_domain resource often reports the IPv6 Link-Local address (fe80::/10) via the QEMU agent before the Global IPv6 or DHCP IPv4 addresses are fully assigned. Since Link-Local addresses are not reachable from Jenkins worker, the connection is failing.

Additionally, the previous logic could potentially pick up IPv4 Link-Local (APIPA - 169.254.0.0/16) addresses, which would lead to connection timeouts.

Why the loop

To resolve the "empty host" race condition without relying on arbitrary sleep timers, a dynamic waiter has been introduced.

  • Transition to Dynamic Waiting: Replaced the static time_sleep with a local-exec polling loop. This loop queries the Libvirt QEMU agent until a routable (non-link-local) address is detected.
  • Why we still keep the Regex: While the waiter ensures a routable IP exists, the Libvirt metadata still returns a list containing all detected IPs (including fe80::). The regex remains necessary in the host field to explicitly select the routable address from that list and avoid the "Invalid Argument" error.
  • Efficiency: The provisioning phase now starts the moment the network is ready, reducing total deployment time in CI compared to a fixed sleep.
  • Fail-Fast: If a routable IP is not assigned within the timeout period (e.g., DHCP failure), the waiter exits with an error, providing a clear failure point in Jenkins rather than a generic SSH timeout.

What does this PRs

Updated the connection block in the terraform_data.provisioning resource to use a strict filter for the host attribute.

Logic: The new logic iterates through all addresses reported by the VM's first network interface and excludes any string starting with fe80 (IPv6 Link-Local) or 169.254 (IPv4 Link-Local).

Result: The provisioner now only attempts to connect to routable Global IPv6 or IPv4 addresses.

Safety: Removed the fallback to 127.0.0.1. If no routable address is found, the host now evaluates to null. This prevents the provisioner from "masking" the failure by attempting to SSH into the local runner/bastion host.

16:06:02  │ Error: file provisioner error
16:06:02  │ 
16:06:02  │   with module.build_validation_module.module.server[0].module.server.module.host.terraform_data.provisioning[0],
16:06:02  │   on /home/jenkins/workspace/manager-4.3-qe-mi-validation-sles/results/sumaform/backend_modules/libvirt/host/main.tf line 278, in resource "terraform_data" "provisioning":
16:06:02  │  278:   provisioner "file" {
16:06:02  │ 
16:06:02  │ timeout - last error: dial tcp [fe80::a8b2:93ff:fe02:3d1]:22: connect:
16:06:02  │ invalid argument
16:06:02  ╵
script returned exit code 1

Depends on SUSE/susemanager-ci#1934

@maximenoel8 maximenoel8 requested a review from a team as a code owner March 12, 2026 03:50
type = "pty"
target_port = "0"
target_type = "serial"
source_host = null
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why those lanes were remove ?

Answer:
They are redundant defaults that can occasionally cause validation warnings or clutter in modern OpenTofu/Terraform providers

@srbarrios
Copy link
Copy Markdown
Member

Instead of forcing IPv4, I would like first to understand and have an explanation of why using IPv6 fails that particular environment.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts the libvirt host provisioning logic to prefer an IPv4 address for the initial SSH/provisioner connection, addressing failures when Terraform attempts to connect via an IPv6 link-local (fe80::/10) address.

Changes:

  • Update connection.host selection to prioritize IPv4 addresses and avoid fe80:: link-local IPv6 addresses.
  • Consolidate multiple remote-exec provisioners into a single provisioner with sequential commands.
  • Minor cleanups/formatting adjustments in backend_modules/libvirt/host/main.tf.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread backend_modules/libvirt/host/main.tf Outdated
Comment thread backend_modules/libvirt/host/main.tf Outdated
Comment thread backend_modules/libvirt/host/main.tf
@Bischoff
Copy link
Copy Markdown
Contributor

Bischoff commented Mar 12, 2026

Instead of forcing IPv4, I would like first to understand and have an explanation of why using IPv6 fails that particular environment.

Linux kernel would always prefer IPv6 over IPv4 when there is a choice. So it's expected to use IPv6 whenever possible. It should work.

I would like to understand better the situation too.

@Bischoff Bischoff self-requested a review March 12, 2026 09:49
Copy link
Copy Markdown
Contributor

@Bischoff Bischoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not merge this, at least for now.

From what I see after some quick debugging, there is a problem at network level I need to solve.

@maximenoel8
Copy link
Copy Markdown
Contributor Author

Instead of forcing IPv4, I would like first to understand and have an explanation of why using IPv6 fails that particular environment.

I updated the PR description

Comment thread backend_modules/libvirt/host/main.tf Outdated
@maximenoel8
Copy link
Copy Markdown
Contributor Author

Moving it to draft, the new version is not working yet.

@maximenoel8 maximenoel8 marked this pull request as draft March 12, 2026 10:44
@maximenoel8
Copy link
Copy Markdown
Contributor Author

Ok, the changes are working

@maximenoel8 maximenoel8 marked this pull request as ready for review March 12, 2026 10:48
@Bischoff
Copy link
Copy Markdown
Contributor

Ok, the changes are working

please change the title(s), we are not forcing ipv4 anymore

@maximenoel8 maximenoel8 changed the title Force ipv4 During first ssh connection, wait for routable ip Mar 12, 2026
@srbarrios srbarrios requested a review from Copilot March 13, 2026 06:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment thread backend_modules/libvirt/host/wait_for_ip.sh
Comment thread backend_modules/libvirt/host/get_ip.sh
Comment thread backend_modules/libvirt/host/main.tf
Comment thread backend_modules/libvirt/host/main.tf Outdated
Comment thread backend_modules/libvirt/host/wait_for_ip.sh Outdated
Comment thread backend_modules/libvirt/host/wait_for_ip.sh Outdated
@srbarrios
Copy link
Copy Markdown
Member

srbarrios commented Mar 13, 2026

Instead of using external scripts, maybe we can use the same mechanism of local-exec + qemu-monitor-command (but with different call) that I made on my PR for the libvirt provider example.

See:

# Post-provisioning: Automatic SSH Port Forwarding 
 # Maps host port 2222 to guest port 22 via QEMU monitor command (HMP) 
 resource "null_resource" "auto_ssh_port" { 
   # Ensure the VM is created before attempting port mapping 
   depends_on = [libvirt_domain.ubuntu_vm] 
  
   # Force execution on every 'apply' to maintain the mapping 
   triggers = { 
     always_run = timestamp() 
   } 
  
   provisioner "local-exec" { 
     command = <<-EOT 
       echo "Waiting for VM to initialize..." 
       sleep 10 
       virsh -c qemu:///session qemu-monitor-command ubuntu-vm --hmp "hostfwd_add hostnet0 tcp::2222-:22" 
       echo "SSH access available at: ssh ubuntu@localhost -p 2222" 
     EOT 
   } 
 }

Comment thread backend_modules/libvirt/host/main.tf Outdated
@maximenoel8 maximenoel8 force-pushed the force_ipv4 branch 2 times, most recently from 1181955 to f10a0a8 Compare March 18, 2026 20:58
@maximenoel8
Copy link
Copy Markdown
Contributor Author

Instead of using external scripts, maybe we can use the same mechanism of local-exec + qemu-monitor-command (but with different call) that I made on my PR for the libvirt provider example.

See:

# Post-provisioning: Automatic SSH Port Forwarding 
 # Maps host port 2222 to guest port 22 via QEMU monitor command (HMP) 
 resource "null_resource" "auto_ssh_port" { 
   # Ensure the VM is created before attempting port mapping 
   depends_on = [libvirt_domain.ubuntu_vm] 
  
   # Force execution on every 'apply' to maintain the mapping 
   triggers = { 
     always_run = timestamp() 
   } 
  
   provisioner "local-exec" { 
     command = <<-EOT 
       echo "Waiting for VM to initialize..." 
       sleep 10 
       virsh -c qemu:///session qemu-monitor-command ubuntu-vm --hmp "hostfwd_add hostnet0 tcp::2222-:22" 
       echo "SSH access available at: ssh ubuntu@localhost -p 2222" 
     EOT 
   } 
 }

Thanks for the reference @srbarrios!

I looked at your example in dmacvicar/terraform-provider-libvirt#1288. The qemu-monitor-command approach works well for simple port forwarding, but I think the use cases are different enough to justify keeping the external scripts here. Here's my reasoning:

Inline local-exec (your example) External scripts (this PR)
Wait strategy Static sleep 10 Polling loop with retries
Failure behaviour Silent (continues regardless) Fail-fast with clear error
Readability Fine for simple commands Complex logic better kept out of HCL heredocs
Testability Hard to test in isolation Scripts can be run/debugged independently

The core difference is that a static sleep isn't reliable enough for our CI environment, DHCP timing varies, and we need to know early if a routable IP never appears rather than getting a generic SSH timeout later.

Embedding a multi-step polling loop with grep/awk/virsh error handling inside a HCL heredoc would work, but it becomes difficult to read and debug. Keeping the logic in wait_for_ip.sh means it can be run standalone against any domain for troubleshooting.

@NamelessOne91
Copy link
Copy Markdown
Contributor

NamelessOne91 commented Apr 16, 2026

I am not sure if this is strictly related, but I remember that when we updated the terraform-libvirt-provider we had similar issues around IPv6 and link-local addresses being used. Especially for retail.

Technically, the provider we build in OBS should be patched to avoid considering an interface ready if that's the only available address.
See https://build.opensuse.org/projects/systemsmanagement:sumaform/packages/terraform-provider-libvirt/files/ipv6.patch?expand=1

Can you double check that:

  1. we are indeed using a provider binary installed from our repo and RPM. Not one pulled from the public registry.
  2. if that DEBUG log pops up and there's a match with the IP causing troubles

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants